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Abstract — In this paper, we consider a synchronization prob- 
lem between nodes A and B that are connected through a two- 
way communication channel. Node A contains a binary file X of 
length n and node B contains a binary file Y that is generated by 
randomly deleting bits from X, by a small deletion rate /3. The 
location of deleted bits is not known to either node A or node 
B. We offer a deterministic synchronization scheme between 
nodes A and B that needs 0(n/3 log 4) fin transmissions 
and reconstructs X at node B with probability of error that 
is exponentially low in the size of X. The rate of our scheme 
matches the optimal rate for this channel. Our scheme can be 
extended to other editing models, e.g., insertions, repetitions or 
replacements, as long as the rate of editing is small. 

Keywords: Two-way communication, deletion channel, 
synchronization, edits, coding for synchronization. 

I. Introduction 

Consider two nodes A and B that respectively hold files 
X and Y, where file Y can be derived from file X by some 
deletions. For instance let 

X = 0010110001010111, and 

D D DD D 

Y = 00010001011. 

Here Y is derived from X by 5 deletions, where deleted bits 
are denoted by D. We call Y a deleted version of X. Suppose 
that the locations of deleted bits are unknown to both nodes. 
In this paper we are interested in the following question: 

• What is the optimal transmission protocol for synchro- 
nizing the content of node B with the content of node A, 
i.e., how to reconstruct an estimate of file X at node Bl 
By way of optimality, we are mainly concerned with the 
number of transmitted symbols between two nodes and the 
complexity of implementing the protocol at nodes A and B. 
Also, as usual, we desire the reconstructed estimate of X at 
node B to have symbol error probability that is exponentially 
small in the size of X. 

Synchronization from deletions is a special case of a more 
general synchronization problem where file Y can be derived 
from X by a sequence of edits. An edit can refer to either 
deletion of a bit from file X or insertion of a new bit within 
X. File synchronization from random edits is the subject of 
many practical applications. Over the web, file updating is 
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1 f( n ) £ C(p(n)) if there exist positive constants C and no such that 
|/(n)| < Cg(n) for n > n . 

2 All logarithms in this paper are in base 2. 



an application where a user or a server needs to synchronize 
its outdated version of a file with a newer version. The new 
updates of a file can usually be modeled as random edits of 
its content. As another example, consider a search engine that 
constantly updates its database in order to reflect the latest 
changes to the content of websites. Here as well, changes 
can be modeled by random edits to the content of websites. 
Another area of application is in distributed storage networks 
where several backup nodes store the same content and need 
to be regularly synchronized together. Mis-synchronization in 
storage devices can be due to mis-synchronized clock speeds 
of read and write heads of hard drives or crashes in random 
parts of the hard drive. 

A. Previous Work 

There has been a large body of research on synchronization 
from edits. In Q], Varshamov and Tenegolts offered a coding 
scheme for recovery from one asymmetric error. Soon there- 
after, Levenshtein (2j showed that the scheme of Varshamov 
and Tenegolts can be used for synchronization from one dele- 
tion or one insertion. In |3|, Orlitsky proved several bounds 
on the minimum number of transmitted bits under a restricted 
number of communication rounds. He further showed that if 
the number of edits is known to the nodes A and B, then 
there is a protocol that can asymptotically achieve the number 
of transmitted bits identical to the case where node A has 
access to file Y. 

While the results of J3;| are nonconstructive, several re- 
searchers have provided explicit code constructions. Let n 
denote the length of file X. For S number of edits, Schwarz 
et al. 101 devised a protocol based on hash functions that 
needs O(Jlognlog^) transmitted bits. Cormode et al. 
offered an e-error protocol with c(e)<51og 3 rt total transmitted 
bits, where c(e) is a constant that depends on the error e. 
For the same setting, Evfimievski ||6l devised a protocol 
with the number of transmitted bits that is a polynomial in 
log n, log -, and 8. For an unknown, fixed number of edits S, 
Orlitsky and Viswanathan [7| showed that the e-error optimal 
protocol needs at most Slogn + log = transmitted bits. They 
also provided an explicit synchronization protocol that needs 
251ogn(logn + log log n + log - + log S) transmitted bits. 
More recently, Venkataramanan et al. [8| offered a synchro- 
nization scheme that can correct S = o(^p^) |^J edits with 
(4c + 1)6 log n transmitted bits from node A to node B and 

3 /( n ) S o(g(n)) if for every e > there exists no such that |/(n)| < 
£|g(n)| for n > no- 
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10(<5 — 1) transmitted bits from node B to node A for any 
positive integer c. The error of the reconstruction is at most 
f n where d is the number of deleted bits in X, out of 5 
total edits. 

In practice, RSYNC (9] is a popular UNIX application for 
synchronizing between edited files. The RSYNC method can 
be in general very inefficient and the number of transmitted 
bits can be exponentially larger than the optimal number. There 
have been many improvements over the baseline approach. For 
example Suel et ai, [ 1 1 proposed a protocol that sometimes 
can save up to 50% of bandwidth over RSYNC. There are 
also more specialized synchronization tools, such as VSYNC 
ifTTl . which synchronizes between video files. 

B. Our Contribution 

While most of the previous work has concentrated on 
synchronizing from a fixed number of edits between the 
two files X and Y, in this paper we are interested in a 
more practical scenario, which is synchronizing from a fixed 
rate of edits between the two files. In this paper we only 
study synchronization from deletions and will discuss possible 
extensions to the more general case of deletions and insertions. 
More specifically, we consider synchronization between node 
A and node B where node A has a binary string X that is 
generated by an i.i.d Bernoulli process of parameter |. Node 
B has a binary string Y that is generated from X by randomly 
and independently deleting bits of X with probability /3 that 
is very small. We are interested in an optimal transmission 
protocol for synchronizing between nodes A and B when n, 
the length of X, is large. 

In order to evaluate a lower bound on the optimal number 
of transmitted bits between nodes A and B, suppose that 
node A has access to string Y, Then, the optimal number 
of transmitted bits to node B, needed for reconstructing X is 
H(X\Y) which is the conditional entropy of string X given 
string Y. Ma et al. |[T2l considered a more general set-up 
where the deletion pattern follows a stationary Markov chain. 
By applying the result of lfT2l to our model, for small values 
of /?, the entropy H(X\Y) can be estimated as follows 

H{X\Y) =n(/31og^+0(/3)) 4 (1) 

Therefore, any synchronization protocol needs at least 
n/3 log \ transmitted bits. Paper lfl2ll further uses tools from a 
well studied problem of source coding with side information 
[13 1, [14 1 to show that there exists a randomized synchro- 
nization protocol on a one-way channel that asymptotically 
needs H(X\Y) transmitted bits. However, [12| does not offer 
any explicit, deterministic construction for the synchronization 
protocol. Notice that the most efficient previous constructions 
(e.g., (HI) for a fixed number of edits 6 require 0(<51ogn) 
transmitted bits between A and B. A naive application of such 
results to our setup would require 0(n/3 log n) transmitted bits 
between A and B for large n, which is clearly far from being 
optimal. 



4 f(P) £ 0{g(P)) if there exists positive constants C and , 
< Cgtf)\ for /3 < A). 
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In this paper, we offer the first explicit and deterministic 
construction of a protocol for synchronizing from a small rate 
of deletions on a two-way channel. The protocol is optimal 
within a constant multiplicative factor and needs O (n/3 log -g) 
transmitted bits. Furthermore, we demonstrate that the error 
probability of synchronization at node B is exponentially small 
in n. Finally, we show that our scheme needs a running time 
that is at most 0(n 4 ^ 6 ). 

The rest of the paper is organized as follows. In Section 
[II] we present the problem setting and the main result along 
with a sketch of our synchronization scheme. In Section III 
we present the mathematical details of our synchronization 
protocol and the proof of the main result in the paper. Section 
|TV| discusses practical implications of our protocol for low- 
complexity synchronization algorithms and Section[V]includes 
concluding remarks and directions for possible extensions. 

II. Problem Setting and the Main Result 

A. Preliminaries 

We represent a binary string Z of length I by Z = 
Z(l),Z(2),- ■ ■ ,Z{£). For 1 < i < j < £, Z(i,j) denotes 
the substring Z(i), Z(i + 1), ■ ■ • , Z(j) of Z. If Z x is a string 
of length l\ and Zi is a string of length £2, we denote by 
Z\ , Z2 the string of length l\ + £2 obtained by concatenation 
of Zx and Z 2 - For a string Z we let \Z\ denote the length of 
Z. 

A deletion channel is a channel that deletes a subset of the 
bits of the input string. A deletion channel is characterized 
by a deletion pattern D which is a binary string of the same 
length as the input string. Let X be the input to the deletion 
channel with deletion pattern D and Y be the output of the 
channel. The deletion channel deletes bit X(i) from the input 
X if D(i) = 1 and transmits the bit X(i) if D(i) = 0. For 
example, the output of a deletion channel with input X = 101 
and deletion pattern D = 010, is Y = 11. 

Corresponding to the deletion pattern D, we define a 
function fjj which maps the indices of bits in the input string, 
to their corresponding indices in the output string. If for index 
i, D(i) = 0, then f D (i) = i - Y^j<i and if D (i) = 1. 
then = where %' is the largest index, smaller than 

i, for which D(i') = 0. 

B. The Main Result 

Suppose that node A contains a file that is represented by 
a binary string X of length n. Let node B contain a file Y of 
length m that is the output of a deletion channel with input X 
and deletion pattern D. We assume that the deletion pattern is 
unknown to nodes A and B. Suppose that the source file X is 
generated by an i.i.d Bernoulli source of parameter | and that 
the deletion channel deletes bits of X independently and with 
probability /3 1. We are interested in a synchronization 
protocol on a two-way, error-free channel between nodes A 
and B so that node B can recover string X from string Y with 
a small probability of error at the end of the communication 
session. Our main contribution in this paper is proving the 
following theorem. 
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Theorem 1. There exists a deterministic synchronization 
protocol between nodes A and B on a two-way channel, 
that on average transmits 0(nf3 log 4) bits and generates an 

estimate X — X(l), ■ ■ ■ ,X(n) of X at node B^such that 
Pr jjf(i) ^ X(i)\ < 2" n ( n ) for every 1 < i < npj 

We prove the theorem by explicitly constructing a syn- 
chronization protocol. Next, we provide an overview of our 
synchronization protocol and prove its optimality. 

C. Synchronization Protocol 

Recall that node B has string Y which is a deleted version 
of string X. We next explain a synchronization protocol that 
enables node B to reconstruct an estimate of string X with 
a small probability of error. The synchronization protocol 
has three main steps as illustrated in Figure [T] Each step 
is performed by a module at node B that has a two-way 
communication link to node A. The three modules work in 
serial, such that the input to the first module is string Y and 
the output of the last module is the estimate X of string X. 
Suppose that X is divided into substrings as follows 

X = Ti, Si, T2, S2, ■ ■ ■ , 7fe_i, Sk-i,Tk, 

where \Si\ = Ls and \T%\ = Lt- Substrings Si, ■ ■ ■ , Sk-i are 
called pivot strings and substrings Ti, • • • , Tk are called data 
strings. We choose Lt — jj and Ls — 0(log4j and both 
nodes A and node B know the exact values of Lt and Ls- 
Note that the length of a pivot string is much smaller than the 
length of a data string. We will determine the exact value of 
Ls later during our analysis. 

1) The first step of the decoding process is performed by 
the matching module at node B. In this step, node A 
sends pivot strings Si, 1 < i < k — 1, to node B. Upon 
receiving the pivots, the matching module attempts to 
figure out the positions of pivots in Y by finding the 
exact copies of Si's within Y. Due to possible deletions 
within Sj's, the matching module is able to find the 
exact matches for only a subset of SVsj^] Suppose that 
the matching module finds matches for S^, • • • , Si , 
where k 1 < k. Based on the position of matched Si's, the 
matching module divides Y into substrings as follows 
and sends it to the next module, 

Y = Pi, Si ± , P2, Si 2 ,- ■ ■ , Pk'-i, Si kl l , Pk>, (2) 

where Pj denotes the substring between matched pivots 
5^_! and in Y. 

2) The next step is performed by the deletion recovery 
module at node B. After receiving the divided Y from 
the matching module, the deletion recovery module 
sends the indices {ii, ■ ■ ■ ,ik>-i} of the matched pivots 
in Y to node A. Upon receiving the indices, node A 
divides X into substrings as follows: 

X = Pi, S^, P2, Si 2 , ■ ■ ■ , Pk'-i, Si kl l ,Pk', (3) 

5 /(n) S Q(g(n)) if there exist positive constants C and no such that 
f(n) > Cg(n) for n > uq. 

6 We will explain later the other possible cases when there are multiple 
matches for a pivot but an error is made by detecting a match that is not due 
to the original pivot. 



where Pj denotes the substring between pivots Si J _ 1 
and Si j in X and can be written as follows: 

Pj = Tij-i+l) Si,_i+1) " " " j Sij — i,Tiy 

Notice that if S r ,- J _ 1 and are matched correctly in 
Y, then Pj can be derived from Pj by some sequence 
of deletions. In this step, nodes A and B use the 
synchronization protocol of Venkataramanan et ai, [8] 
with parameter c = 3 Q to recover from deleted bits of 
Pj and to form an estimate of Pj for each 1 < j < k. 
Let us denote by Pj the estimate of Pj at the output 
of the deletion recovery module. Notice that Pj has the 
same length as Pj. At the end of this step, the deletion 
recovery module forwards the string 

X = Pi, Djj, P2, Si 2 , ■ ■ ■ , -Pfc'-i, Si kl _ t , Pf.' (4) 

as an estimate of X to the next module. 
3) At this step, the LDPC decoder module at node B, 
recovers from the errors made by the first two steps. At 
the first step of the protocol, due to a potential existence 
of multiple copies of each Si within Y, the matching 
module may erroneously match Si at a wrong place. 
Suppose Si- is a pivot that the matching module has 
matched at a wrong place. Then, Pj and Pj+i may not 
be realizable by deleting subsets of bits from Pj and 
Pj+i respectively. As a result, after the deletion recovery 
module, Pj and Pj+i may be different from Pj and 
Pj+i, respectively. Furthermore, even if the matching 
module has matched pivots Sj._j and Si- correctly in 
Y and Pj is a deleted version of Pj, the protocol of 
Venkataramanan et al. JSJ, used in deletion recovery 
module, could introduce additional errors. 
Suppose that the total error of the first two synchroniza- 
tion modules is bounded by £, 

Pv{Pj^P j ) <C 

We notice that the output of the deletion recovery 
module, X, is in synchronization with X, in the sense 
that \Pj\ = \Pj\ for each 1 < j < k' and hence X{i) is 
the estimate of X(i) for each index 1 < i < n. Since 
the error rate for substrings Pj, 1 < j < k', is an upper 
bound for the bit error rate in X, we find that 

Pr{l(i)^X(i)} <C- 

As a result, X can be modeled as an output of a 
Binary Symmetric Channel (BSC) with the crossover 
probability at most £. To recover from errors of X we 
then use a powerful additive-error correction code. Our 
choice is an LDPC decoder which receives parity check 
bits of a systematic LDPC code ifTBl . If node A sends 
a sufficient number of parity check bits to the LDPC 
decoder module, as shown in [16|, the output of the 
decoder will be a string X with 

7 Here, c is an arbitrary constant that defines the tradeoff between 
complexity of the protocol and the error in the output of tf the decoder. 
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Figure 1. Illustration of the synchronization protocol. 



as previously stated in Theorem [T] 

Next, we wish to estimate the total number of transmitted 
bits used by our synchronization protocol. We first establish 
a measure of the performance of the matching module of the 
decoder. 

Theorem 2. For Ls > 11 + 2 log -g, there exists a matching 
module that, with probability 1 — 2~ n ( n \ matches a subset 
{S ii; --- ,S ik ,_ x } of pivots {Si,-- - ,S k ~i} with k' = (1 - 
Ls(3 + 2/3 + o(/3))fc such that the probability of error in 
matching Sj. is at most j3 + o(/3). 

We devote Section [III] to proving this theorem. For the rest 
of our argument we set Ls = 11 + 2 log 4, which is the 
minimum value of Ls required by Theorem [2] 

Next we use Theorem |2] to estimate the total number of 
transmitted bits needed by the synchronization protocol. 

Lemma 1. On average, the total number of transmitted bits 
of the synchronization protocol is 0(n(3 log 4). 

Proof: First notice that k = g^pf- ~ nj3. The number 
of transmitted bits in the first step is 

(k - 1)L S a llnP + 2n(3 log i = O(n0 log i). 

At the second step, the protocol of Venkataramanan et al. [8] 
for the recovery from deletions within each Pj , 1 < j < k' 

8 /(^) G o(g(/3)) if for every e > there exists /3 such that < 
e\g(P)\ for /3 < O . 



with parameter c = 3 needs 13Sj log |Pj| + 10(<5j — 1) trans- 



mitted bits, where Sj := \Pj\ 



\PA is the number of deleted 



bits in Pj. Therefore, the average number of transmitted bits 
in the second step is less than 



IE 



Notice that Y^j=i $j ^ s tne tota l number of deleted bits from X 
and is on average nj3 (recall that we assumed that no deletions 
occurred in the matched pivots). 

In Appendix I we show that E [Sj log \Pj\] < 16 + 8 log A . 
Therefore, the average number of transmitted bits in the 
deletion recovery module is upper bounded by 

fc'l3(16 + 8 log -) + lOn/3 < n/313(16 + 8 log -) + 10n/3 

P P 

= 0(n/3 log i), 

where we used the inequality k' < k ps n(3. 

For the last step, we would like to estimate the error £ in 
Pj, By Theorem 2 the error probability in matching Si j _ 1 
and Sj . is at most j3 each. Since Pj is the common neighbor 
of Si._ 1 and Sj., with probability at most 2/3, the string 
Pj is not a deleted version of Pj. Furthermore, the error 
in the protocol of Venkataramanan et al. [8| for c = 3, 
is upper bounded by Sj |p S j 3 Pj ■ . Since E = /3Lt = 1 
and also \Pj\ = Lt = 4, the average probability of error 
by the protocol of Venkataramanan et al., is upperbounded 



jT(13*ilog|P,-| + 10<^ 
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by (3 3 log \ — o((3). Counting the error from matching 
module, we have Pr |Pj ^ PjX < 2(3 + o(f3), and therefore 

Pv[x(i)^X(i)} <2/3 + o(/3). 

In order to recover from errors induced by a BSC with 
crossover probability of at most 2/3 + of/3), node A needs to 
send nH{2(3 + o((3)) = 0(n(3\og^) Pj parity check bits to 
node B. 

The average number of transmitted bits in all three steps of 
the protocol is 0(n(3\og j). Therefore the average number of 
transmitted bits by the algorithm is 0(n(3\og i). ■ 

In the next section we prove Theorem [2] 

III. Proof of Theorem[2] 

In this section, we propose a construction of a matching 
module such that for L$ > 11 + 2 log \ with probability 1 — 
2-si(n) jjj e mo( j u j e matches kl pivots, out of which at most 
(3k pivots are matched erroneously. Since (3k = ((3 + o{(3))k' , 
our construction implies an error of (3 + o((3) in matching the 
pivots which is equivalent to the statement of Theorem [2] 

We will frequently use the following concentration theorem 
in our argument: 

Theorem 3 (Hoeffding [17|). Let po be the probability that a 
biased coin shows heads. Then for every e > 0, the probability 
that N tosses of the coin yield a number of heads between 
(po — s)N and (j>q + e)N is at least 1 — 2e" 



-2e 2 N 



We will occasionally need a stronger version of the previous 
theorem: 

Theorem 4 (Hoeffding fTTh . Let Zi, ■ ■ ■ be i.i.d random 
variables with expected value M that take values in an interval 
of length I. Then, for every e > 0, the following holds 



Pr- 



N 

E 

i=l 



NM 



>eN) < 2 exp 



2e 2 N 



Recall that the string X is partitioned into substrings as X = 
T^Sx,--- ,T fc _i,S fe _i,T fe , where \T,\ ^ L T and \Si\=L s . 
In our setup — jj-Ls = O(log^), and k sa n(3. Let us 
denote the index of the first bit of Si in X by §i and the 
index of the last bit of Si in X by Si. Similarly, the first 
and last indices of Ti are denoted by ii and ti. Therefore, 
X(s h Si) = St and X{U,U) = T t . 

The task of the matching module is to find "right matches" 
of S^s within string Y. Next, we formalize the notion of right 
and wrong matches for a pivot Si. 

A. Right and Wrong Matches 

Consider the substring D(si,Si) which is a part of the 
deletion pattern D that acts on the pivot Si. We consider the 
following cases: 

• D(si,Si) is the all zeros vector: There is no deletion 
within Si. In this case we call the copy of Si between 

'Here and in the reminder of the paper we use H (■) to refer to the binary 
entropy function defined as follows: H(t) = tlog ~ + (1 — i) log for 
< t < 1. 
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Figure 2. Illustration of a matched Si with one deletion. 



indices and foisi) of Y the right match of Si. 

All other copies of Si in Y are considered wrong matches 
of Si. 

• D(§i, Si) has one nonzero element: There is one deletion 
within Si. In this case, if there is a copy of Si in Y 
that begins at /d(Sj) or ends at /n(si) then we call it 
a right match of Si and all other copies of Si are called 
wrong matches of Si. If there is no such copy of Si 
within Y, then all copies of Si within Y are called wrong 
matches. Notice that in this case there are possibly two 
right matches for Si. For instance let Si — 000 and let 
the immediate undeleted bits before and after Si be zero. 
Then it is easy to verify that after one deletion within Si, 
there is a copy of Si starting at /d(sz) in Y and there is 
another copy of Si ending at /d(sj) in Y. 

• D(§i,Si) has more than one nonzero element: There are 
more than one deletions within SV In this case all copies 
of Si within Y are considered wrong matches. 

While the definition of right and wrong matches is natural for 
the case of no deletion within Si, we next explain the reason 
behind the definition for the case with deletions within Si. 
Consider the illustration in Figure [2] where Si = 01101000. 
Assume the penultimate bit is deleted from it. Suppose that 
the bit right after Si is 0. Notice that even with the deleted bit, 
a copy of Si appears in Y, starting at fo(si). This copy of Si 
is called a right match. The reason is that the resulting string 
Y is the same as in the case where there is no deletion within 
Si and instead the after Si is deleted in X. In other words, 
here we can "move" the deletion from Si to the substring T i+ i 
without changing Y. 

Although a similar scenario may happen when there are 
more than one deletions within Si, i.e., we might be able to 
move the deleted bits from Si to the neighboring data strings 
without changing the resulting Y, since the probability of 
these cases is very small (the exact statement will follow), 
our analysis conservatively counts those matches as wrong 
matches. 

Next, we analyze the probability of occurrence of right 
matches for Si'. 

Lemma 2. With probability 1 — (3Lg + o((3), Si has no 
deletions and there is a right match for Si within Y. 
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Proof: With probability (1 — f3) Ls no bit is deleted from 
Si. For Ls = 0(log we have 

(l-(3) Ls = l-L s (3 + o((3). 

■ 

Lemma 3. W/f/z probability 2/3 + o(/3) f/zere is one deletion 
within Si and there is a right match for Si within Y. 

Proof: Fix h as the place of the deleted bit out of Ls bits 
of Si. Suppose Si(h) = b G {0, 1}. It is simple to observe that 
there is a copy of Si starting at /d(si) in Y if and only if 
Si(h,Ls) = 6, b, ■■■ ,b and furthermore, the first undeleted 
bit after Si in X is also b. In other words, the hth bit of Si 
should belong to the final "run" of zeros or ones of Sj and the 
first undeleted bit after Si should also be of the same value. 
With probability (3(1 - /3) Ls "\ exactly the Mi bit of Si is 
deleted and with probability 2~ ( - Ls ~ h+1 ' ) the bits after hth bit 
in Si and the first undeleted bit after Si have the same value 
as the hth bit of Si. The overall probability of this case is 
/3(1 - /3)L s -i 2 -(Ls-h+i) _ similarly, there is a copy of Si 
finishing at fo(si) in Y if and only if all bits before the hth 
bit in Si and the first undeleted bit before S, are equal to 
the hth bit of Sj. This case happens with probability (3(1 — 
[3) Ls ~ 1 2~ h . The intersection of the two events happens when 
Si is all-zeros or all-ones string and the immediate undeleted 
bits before and after Si have the same value as the bits in Si. 
This case happens with probability /3(l-/3) is " 1 2" (is+1) . By 
using the inclusion exclusion principle and by varying h from 
1 to we find the total probability of having one deletion 
within Si and a right match for Si to be: 

Ls 

0(1 - Pf 3 - 1 (2-( is -' 1+1 > + 2- h - 2-<- Ls+1 ^ = 

h=l 

8(1 - /3) Ls -\2 - 2 1 ~ Ls - Ls2 _(i/S+1) ) = 

20 + 0(0), 

where in the last step we assumed Ls — 0(log 4). ■ 

Lemma 4. With probability o(/3), Si has more than one 
deletions. 

Proof: Since the probability of no deletion within Si is 
(1 — (3) Ls and the probability of one deletion within Si is 
Lsf3(l— (3) Ls ~ 1 then the probability of more than one deletion 
within Si is 

1 - (1 - B) Ls - L S B(1 - p)^- 1 = o(/3), 

where we assumed Ls — 0(log 4) in the final estimate. ■ 
Let us define R := 1 — Lsp + 2/3. From the preceding 
lemmas we conclude that: 

Lemma 5. For a random string X and a random deletion 
pattern D, on average the number of pivots with a right match 
in Y is (R + o(/3))k. 

By applying Theorem [3] we conclude that: 

Lemma 6. For a random string X and a random deletion 
pattern D, with probability \ — 2~ n(jL \ there are (R + o(0))k 
pivots with a right match in Y. 



B. The Matching Graph 

The task of the matching module is to detect right matches 
of S^'s within Y. For this purpose we use a graph theoretic 
method. We define a graph G(V, E) with the vertex set as 
follows. The graph G has k + 1 layers of vertices which are 
denoted by Ao, Ai, • • • , Afe. Each vertex in layer Aj, 1 < i < 
k — 1, represents a match of pivot S, in string Y. We refer to 
the vertices of Aj and matches of Si in Y interchangeably. For 
vertex v G Aj, let v and v denote, respectively, the first and 
the last indices of the match of Si corresponding to v in Y. 
We introduce two auxiliary vertices s and t where Ao = {s} 
with § = and A fc = {t} with t = \Y\ + 1. Vertices s and t 
represent the beginning and the end of string Y. 

We say a vertex in Aj is a good vertex if it corresponds to 
a right match of Si within Y. We call a vertex in Aj a bad 
vertex if it corresponds to a wrong match of Si. By definition 
of right and wrong matches, in each layer of graph G, there are 
possibly zero, one, or two good vertices. In order to detect the 
right matches of Si's within Y, we need to find good vertices 
in graph G. For that, we define the edge set of G such that 
the good vertices are distinguished by their connectivity in the 
graph. 

Let us define the distance between two vertices u and v in 
G as follows: 

Dis(u, v) := v — u — 1. 

Notice that Dis(u, v) is nonnegative only when the first bit of 
v appears after the last bit of u. In that case, Dis(u, v) is the 
number of bits between u and v in Y. 

For two pivots Si and Sj with i < j, the number of bits 
between them in X is given by 

(j-i- l)L s + (j - i)L T . 

If both Sj and Sj have right matches in Y, the number of bits 
between the right match for Sj and the right match for Sj is 
at most (j-i - 1)L S + (j - i)L T . 

Furthermore, in most cases, for i < j, the first bit of the 
right match for Sj appears after the last bit of the right match 
for Sj. To see this, notice that, since the first bit of Sj appears 
after the last bit of Sj in X, if there are no deletions within 
Sj and Sj, their order is preserved in Y. 

However, for instance let Sj = 0000 and Sj = 0000. Also 
assume that all bits between Sj and Sj are deleted except 
for a single bit, and that exactly one bit is deleted from 
Sj and exactly one bit is deleted from Sj. In this case, the 
compound substring of Y corresponding to Sj and Sj and the 
bits in between them in X is 0000000, where the first four 
bits constitute the right match for Sj and the last four bits 
constitute the right match for Sj. As we can observe, the first 
bit for the right match of Sj is the last bit for the right match 
of Sj. The distance between the right match for Sj and the 
right match for Sj is —1. It is easy to verify that in general 
for j > i, the least value of the distance between the right 
match of Sj and the right match of Sj is — 1. 

Based on the two preceding observations, we connect a 
vertex u 6 Aj to a vertex v 6 Aj if and only if 

- 1 < ESs(u, v) < (j - i - l)L s + (j - i)L T . (5) 
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Figure 3. Figure illustrates a graph G with 8 layers of vertices. The horizontal 
axis indicate different layers and the vertical axis indicates the position of each 
vertex in string Y. The good and bad vertices are distinguished by blue and 
red colors respectively. The first layer has only one vertex s and the last 
layer has only one vertex t. As it is seen, all good vertices in the graph are 
connected together and they form an s — t path which is indicated by the 
dashed edges in the graph. 



Therefore all pairs of good vertices are connected together. 
By definition, s and t, which indicate the beginning and the 
ending of string X respectively, are treated as "auxiliary" good 
vertices. Therefore, good vertices across different layers form 
an s — t path in graph G. However, there are potentially many 
other pairs of vertices that satisfy the condition of Equation |5]) 
and are connected together. Figure [3] illustrates an instance of 
graph G with 8 layers and the connections between vertices. 

The following theorem shows that with very high probabil- 
ity bad vertices do not contribute to an s — t path. That is, 
any s — t path of the appropriate length in graph G is formed 
mostly of good vertices. 

Theorem 5. For a random string X and a random deletion 
pattern D, for L$ > 11 + 2 log 4, if we pick any path from s 
to t with Rk + o(/3)k vertices, then with probability at least 
1 — 2~ f2 ( n ) the path has at least Rk— f3k+o(f3)k good vertices. 

Theorem [5] is not only an existence statement, but also has 
an algorithmic implication. The implication is that if we pick 
any path from s to t with Rk + o(f3)k vertices, the path 
has many good vertices. Since finding an s — t path of an 
appropriate length in G is a computationally tractable task (we 
will descuss the computational complexity in the next section), 
finding a large fraction of good vertices is also a tractable task. 
Next we prove the theorem. 

Proof: We begin by finding an upper bound on the 
probability of the existence of a path Q from s to i with 
Rk + o(f3)k vertices out of which ak are bad vertices, for 
some /? < a < 1. There are k + 1 layers in graph G and by 
Lemma |6] with probability 1 — 2 _n (") there are Rk + o{f3)k 
layers with good vertices in graph G. Let us fix the realization 
of the deletion pattern D and the realization of the pivots 
Si in X with exactly one deletion and the realization of the 
immediate undeleted bits before and after pivots Si in X with 
exactly one deletion. In this way, good vertices of the graph 
G are fixed. We consider two cases: 



Case 1: j3 < a < ~ 

For f3 < a < \, first we fix the layers which have a vertex 
on the path Q. Since by assumption, there are Rk — ak + o(f3) 
good vertices on the path Q, the selection of good vertices on 
the path can be done in the following number of ways 

/ Rk + o{fi)k \ /(l- R)k + ak + o{P)k\ ^ 
\Rk- ak + o(f3)k) ' \ ak / 

2 fc((J?.+o(/3))g(- n; ^ w ) + (l-J?.+ a +o( f 3))H( 1 _ H+ g +o()9) )) = 

where the first term of the multiplication stands for the number 
of ways we choose the layers with good vertices on the path 
Q and the second term stands for the number of ways we 
can choose the layers with bad vertices from the remaining 
available layers. 

Suppose that path Q has vertices from layers 
AiuA^,--- ,Ai Rh+oimh . Let X := {1, • • • , Rk + o(/3)fc} 
be the set of indices of the layers with a vertex on the path 
Q. Let 1 = I g U lb where X g is the set of indices of layers 
with good vertices in Q and I b is the set of indices of layers 
with bad vertices in Q (The sets I g and I b are disjoint.). 
That is, a layer A; . with j € I g is a layer with a good vertex 
in Q and a layer with j £ If, is a layer with a bad vertex 
in Q. 

Let us express the path Q as s—v^ —Vi 2 — - ■ ■—Vi Rk+o(mk —t 
where Vi . G A,- . . The path Q is uniquely identified by the po- 
sition of the first bit of its vertices, (vi l: Vi 2 , ■ ■ ■ , Vi Rk+o(f))k ) . 
Equivalently, if we know the distance between consecutive 
vertices (Dis(«j., : j € I), we can uniquely identify 

the position of each vertex on the path. Therefore, next we 
count the number of possible values of the distances between 
consecutive vertices (Dis(vi j ,Vi j+1 ) : j El). 

Since good vertices are pinned down on the path, the value 
of Dis(^ . , Vi +1 ) is determined if both . and v.- L +1 are good 
vertices. Let us define the set % C I as follows 

n = {j:jei b v (j + l) e i b }. 

Therefore (pis(vi j , v ij+1 ) : j £n) is the set of distances 
between consecutive vertices of Q that are undetermined. The 
number of bad vertices on Q is ak. Therefore \H\ < 2ak. Let 
ili 32 with ji < j2 be two consecutive elements of I g . Then 
by additivity of distances, bad vertices , • • • , Vi j2 _ 1 need 
to satisfy the following constraint: 

J2-1 

Dis (^ t >^ t+ i) = Dis l31 ^ J2 ) - (h - ji - (7) 

t=3l 

where (j2 — ji — l)is is the total length of the substrings 
Vi ,Vi- , in Y. Furthermore, bad vertices should be 

placed on Q such that they satisfy the constraint given in Q. 
For every j € H, we need to have 

-1 < Dis(v l]: v l:j+1 ) < (ij+i-i^LT+iij+x-ij-^Ls. (8) 

Next we find an upper bound on the number of integer vectors 
(Dis^. , v ij+1 ) :j €H) that satisfy ^ and 
For j € 2 we use the following change of variables 

Sj ■= (ij+i - ij)L T + (ij+i - ij - l)£s - Dis(u ij .,u ij+1 ). 
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Equation (|7]i in terms of the variables Sj's is written as follows. 
For ji < j2, as any two consecutive elements in the ordered 
version of X g , we have 



32-1 

E 



Sj = (iji - ih)L T + (ih - ij x - 1)L S - Dis^ , v l]2 ). 

(9) 

Observe that in Equation j9j, (ij 2 — ij^Lx + (ij 2 — ij 1 — 
1)L$ is the number of bits between and Si } in X and 
Dis(u^ , ) is the number of bits between the right match 
of Sfy and the right match of in Y. Therefore, the right 
hand side of Equation (|9]l is the number of deleted bits in 
the substring between Si } and in X. To find an upper 
bound on the number of solutions for (Sj : j € H) , we relax 
constraints ([9]) over all j's into a single constraint by adding 
them together: 



E^ 



(10) 



Here 5 is the total number of deleted bits from X. The set 
T~L C = I\H is the set of indices j' for which hj, is determined; 

i.e., Vi , and w, , are both good vertices. Furthermore, <L/ 

ij/ v+i ° ' j 

is the number of deleted bits from the substring between Si., 
and Si , in X. 

Next we use the following result on the concentration of 
TlijeH $3 around its expected value. 

Lemma 7. For a random string X and a random deletion 
pattern D the following bound holds: 



Pr 



Si - E 



= o(/3)fc > > 1 - 2 



-n(n) 



Proof: See Appendix II 
To estimate E J2jen 



first notice that the average 
number of deleted bits from X is E [6] = n/3 = (1 + o(/3))k. 

Next we find E [fy] for / G H . Since Q has + o(/3)fc 
vertices, the average size of the substring between Si., and 
Si jl+1 in X is flfcq^rgjfc ■ Therefore, E[#j/], the average 
number of deleted bits from the substring between Si., and 
Si., , in X is 



1 



Rk + o((3)k R + o((3) 



Since < 2ak and |I| = 
find that 



= l + PL a -2p + o(P). 
\H C \ = (R + o(fi))k, we 

o((3))k. 



\H C \ >{R-2a + o{(3))k = (1 - /3L S + 2(3 - 2a 
We conclude that 

E 





= E [S] - E 


E ^ 









= A - |-H C |E [8?] + o(/3)k 

< k(l - (1 - ^L s + 2(3 - 2a)(l + /3L 5 

= 2afc(l + /3L s -2/3) + o(^)fc, 



2/3) 



and therefore by Lemma [7] with probability at least 1 — 2 n (™) 

= 2afc(l + /3L s -2/3)+o(/3)fc. (11) 

Therefore, we showed that Equation (jTjl yields the weaker 
constraint in (JTTJ on the vector (Sj : j EH). 

Now consider Inequality (|8]l. We can rewrite it in terms of 
Sj as follows 

< Sj < (ij + i - ij)L T + (i j+ i - ij - 1)L S + 1. 

To find an upper bound on the number of solutions for (Sj : 
j £ If), we relax the preceding constraint to Sj > 0. 

Under the constraint that Sj > 0, the number of integer 
solutions for (Sj : j € %) under Condition (JTTJ, is given by 



2ak(\ + (3L S - 2/3) + o((3)k + \U\-\ 
\H\-1 
/2ak(2 + (3L s ~2p+^) 
V 2ak 



< 
< 



(12) 



where the last estimate holds for sufficiently small (3. 

Given the number of possibilities for path Q, we next 
compute the probability of occurrence of each realization of 
path Q. Since X is generated by an i.i.d Bernoulli source of 
parameter |, different substrings of X are independent and the 
probability of any given realization of bad vertices as specified 
by the choice of Sj's is 2~ Lsak . By applying the union 
bound on the probability of existence of individual paths, 
using Inequality (|6]l and Inequality ([12]), we conclude that the 
probability of the existence of a path Q with Rk + o((3)k total 
vertices and ak bad vertices is upperbounded by 2 Aak where 



(1 -R + a)H 



1- R + a 



+ 5a — aLs + o((3) 

a 

■■ —a log a + a \ogR — i?log(l — — ) 

R 

a 

+ alog(l — — ) — aloga + alog(l — R + a) 
R 



(l-JZ)log(l 



1- R+a' 



5a — aL 



s 



o(/3). 



Next we find an upper bound for A Q . Since R < 1, then 
alogi? < 0. Since a < |, for small enough (3, R > a. 
Therefore alog(l — J|) < and alog(l — R + a) < 0. Using 
the inequality log(l + x) < f° r l^l < 1 we find that 

„ , . a . „ , . ol . Ra 2a 

-R bg(l ) = R log 1 + < — < - — 

sv R' ftV R-a' ~ (J?-a)ln2 ~ In 2 

where we used the fact that for small values of (3, R is close 
to 1 and R — a > | . Also we have 



ot \ „. , / a 

T^r^) = ^- r M 1 + T^r 



-(l-i?)log(l- 

o((3)) 



< 



(1-R)a a 
(1 -i?)ln2 _ lo2' 
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Therefore 



2a a 

A Q < o(P) - 2a\oga + — + — + 5a - aL s 
In 2 In 2 

= -21o ga +A +5 _ Ls) 
a In 2 



< a(-21oga + 10 - Lg), 

/3 



where we used ^4^- < ^P- — > as (3 — >• 0. 



Ca.se 2: | < a < i? + o(/3) 

We again seek to bound the probability of the existence of 
a path Q from s to t with Rk + o(/?)fc total vertices and ak 
bad vertices. Let the path Q be denoted by s — v. n — vi 2 — 
■ ■ ■ — Vi Rh+oiJj)h — t and let Sj denote the number of deleted 
bits between vertices t?j. and Clearly, the sum of Sj 

is the total number of deletions in string Y. By Theorem [3] 
with probability at least 1 - 2~ n < n ) we have J2f=o° (f>)k ^ = 
nf3 + n(3o{f3) = k(l + o(f3)). The number of integer solutions 
for Sj > under this constraint is 

k + Rk + o((3)k-l\ /2k\ <22k 
Rk + o(P)k-l ) ~ \ k ) ~ 

The probability for each solution of 6/s to represent a valid 
s—t path is at most 2~ Lsak . Therefore, an upper bound on the 
probability of existence of a path Q in this case is 2( 2 ~ LsQ )' c . 

Finally, putting both cases for the range of a together, the 
probability of the existence of a path Q with Rk + o((3)k 
vertices between s and t with at least f3k bad vertices can be 
upper bounded by the sum of two integrals: 



-,A a k 



dak 



R+o(f3) 



2 {2 ~ Lsa)k dak < 



r R+o(j3) 

2 (-21oga+10-L s )afc dafc+ / 2^ Lsa)k dak. 

a=p •/<*=! 

If we pick Ls > 11 + 2 log \ then we find 

(-2 log a + 10 - L s )ak < -ak < -file 
for P < a < \. Also, 

2-L s a < 2 — i • 11 = -3.5. 
j, _ 2 

Therefore, we can upper bound the sum of the two integrals 

by 



k2- pk da 



R+o(/3) 



k2- 3 - 5k da<^(2-? k + 2- 3 - 5k ) 



2~ O(n) 



This yields the result. ■ 
In order to verify the result of Theorem [5] in a practical 
setting, we have plotted graph G for randomly generated 
string X and randomly generated deletion pattern D with 
parameter j3 — 0.01, for three values of Ls in Figure [4] To 
avoid visual complications, we have only plotted edges that 
connect vertices on two consecutive layers. As it is clear from 
the figure, for small values of is, there are many edges in 
the graph and there are potentially many paths that connect 



s to i which do not share many vertices with the correct 
path. However, for larger values of Ls, the irrelevant edges 
disappear from the graph and the only path that remains is 
the one formed by good nodes of the graph. For (3 — 0.01, 
Theorem |5j states that Ls > 11 + 21og-^ w 17 is sufficient 
for our purpose. In practice, we observe values of Ls around 
8 are sufficient for distinguishing good vertices on graph G. 

IV. Practical Implementation 

In this section we discuss practical implementation of our 
synchronization protocol, consisting of a matching module, a 
deletion recovery module, and an LDPC decoder module (see 
Figure [TJ. 

For the deletion recovery module, we can implement the 
synchronization protocol of Venkataramanan et al. [ 8 1 which 
runs in linear time in \Pj\ for deletion recovery of each 
substring Pj,l < j < k'. Therefore the overall complexity 
of the deletion recovery module is linear in n. For the LDPC 
decoder module there are many sophisticated encoding and 
decoding schemes (see |[T6l . H3) that need running time 
linear in n. 

In this section we therefore focus on the implementation of 
the graph-based algorithm for the matching module explained 
in the previous section. The result of Theorem [5] indicates 
that to find a large number of right matches for pivots in the 
received string Y, it suffices to find an s — t path with Rk + 
o(j3)k vertices in the matching graph G. We now argue that 
this problem can be cast as the well known "shortest path 
problem" in a directed graph, so it can be efficiently solved 
in polynomial time. 

As the first step, we only keep the vertices in graph G which 
have an edge to vertex t and remove all other vertices. Since 
all good vertices are connected to vertex t, this step does not 
eliminate any good vertex from graph G. Let G denote the 
resulting graph. As the second step, we find the longest s — t 
path in G. Since all good vertices are connected together and 
form an s — t path of length Rk + o(f3)k, the longest path in 
G has at least Rk + o(f3)k vertices. Finally, we modify the 
discovered path into a path with only Rk + o(/3)k vertices 
by keeping only the first Rk + o(/3)k vertices on the path. 
Since each vertex in the graph G has an edge to vertex t, the 
resulting vertices from this step form a path with Rk + o(/3)k 
vertices from s to t. 

The only step of the above procedure which is computa- 
tionally demanding is the second step for finding the longest 
s — t path in G. Notice that since G and hence G are 
acyclic graphs, the longest s — t path problem in G, can 
be reduced to the shortest s — t path problem in G by 
assigning weight —1 to each edge. The latter problem is 
solvable in time (9(|G| 2 ), for instance by Dijkstra's algorithm 
lfT8ll . where \G\ is the number of vertices in G. We upper 
bound \G\ by \G\. To approximate \G\, we notice that there 
are approximately n(3 layers in graph G and the number of 
vertices in layer A, is the number of copies of pivot Si in 
Y, which is approximately 2~ is |F| = 0(f3 2 n). Therefore, 
\G\ w 0((/3 2 n) • (n/3)) = 0(n 2 /3 3 ). We conclude that the 
complexity of matching pivots in graph G is upper bounded 
by 0(\G\ 2 ) = 0(n 4 ^ 6 ). 
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V. Conclusions 

In this paper we offered the first synchronization protocol 
for recovering from a small rate of deletions with an optimal 
order of transmitted bits. The main idea was to divide the 
synchronization problem into synchronization between shorter 
substrings of the source file and destination file. For that, 
our protocol sends equally spaced small substrings of the 
source file to the destination, and destination then uses a graph 
theoretic algorithm to locate the short substrings within its file 
with high accuracy. For synchronization between the shorter 
substrings we used existing protocols that recover from a small 
number of edits. We observed that the compound output of the 
first two steps can be modeled as an output of a BSC with 
a small error probability. This error can be recovered with a 
low bit error rate by using an LDPC coding scheme. 

While in this work we only considered recovering from i.i.d 
patterns of deleted bits, there are many other interesting edit 
models that the ideas of this paper can be applied to. An 
immediate extension of our work is to the synchronization 
from i.i.d insertions. To explain an i.i.d insertion process, 
let us consider an equivalent description of the i.i.d deletion 
process considered in this paper. In the new description, the 
deletion pattern D is described as an independent sequence 
of positive integers, where the integers alternatively represent 
the length of zero and one runs in the deletion pattern D. 
It is easy to verify that if the integers are generated inde- 
pendently according to an appropriate geometric distribution, 
the result is an i.i.d 0-1 deletion pattern. We can describe the 
insertion pattern in the same way by generating the run length 
sequence of the pattern. For the insertion pattern, each run 
of ones corresponds to an inserted substring of equal length 
generated by an i.i.d Bernoulli process. Also, each run of zeros 
correspond to a substring of the input string of equal length in 
the output. It is not hard to see that the solution of this paper 
for synchronization from deletions is directly applicable to 
solving the synchronization problem from random insertions. 

One can also consider more general patterns of deletions or 
insertions, e.g., the 0-1 deletion (insertion) patterns that follow 
a markov chain random process (see IT2l ). 

Another interesting direction for the extension of this work 
is the design of synchronization protocols that are capable of 
recovering from a small rate of both deletions and insertions. 



While the deletion recovery module in our work, based on the 
algorithm by Venkataramanan et al. J8), is directly applicable 
to recovery from deletions and insertions, the main challenge 
is to extend the graph theoretic algorithm for matching the 
pivot substrings in the received string Y when there are both 
deletions and insertions. Again, many parts of our argument 
still hold for the new setting as long as the edits happen 
with small rates while some technical parts may need to be 
modified. This extension is the focus of our current research. 

There are some other aspects of our current research that 
can be modified into more efficient synchronization protocols. 
For example, our algorithm needs a small backward bandwidth 
from node B to node A in the deletion recovery module. This 
bandwidth is an inherent component of the synchronization 
protocol of Venkataramanan et al, J§|. It is of great interest 
to design protocols that can operate on forward links only. As 
proved by Orlitsky |3 |, design of optimal protocols for recov- 
ery from deletions on forward links implies optimal protocols 
for recovery from deletions and insertions. Furthermore, such 
protocols can be implemented as efficient channel codes for 
communicating over edit channels fl2l . fl9l . Il20l . 1211 . 

Finally, from a practical perspective, it is interesting to 
design a more efficient implementation of the graph theoretic 
matching algorithm which is at the heart of our matching mod- 
ule. While our algorithm runs in 0(n 4 /3 6 ) time, we believe 
that by exploiting the specific structure of the matching graph, 
and applying additional restrictions on the connectivity of the 
vertices of the graph together, it is possible to considerably 
reduce the running time of the matching module and hence 
reduce the overall complexity of the synchronization protocol. 
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Appendix I 
Here we evaluate E [Sj log \Pj\]. 

E [Sj log \Pj\] = Pr i\ p j\ = 1} E [Sj log \Pj\\\Pj\ = 1} 
i 

= ^0HoglPi{\P j \=l} 

= Epilog 1^1]. 

Next we estimate E [P\Pj\ log \Pj\] . Notice that Pj is the 
substring of X between S{._ 1 and S{.. There are (ij —ij-x) 
data strings and (ij — ij-\ — 1) pivot strings between Si j _ 1 



and Si . Therefore 

\ p j\ = (ij - ij-i) L T + {ij ~ ij-i ~ l)Ls- 

There is a total of k pivots, and k! of them are matched by the 
matching module. Therefore, with probability p ;= H- ss (1 — 
Ls(3 + 2/3) a pivot Si is matched. Furthermore, the probability 
that a pivot is matched is independent of other pivots. Thus, 
(ij — ij-i) has the following geometric distribution 

Prfo-ij-x =r}=p(l-p) r - 1 . 

Suppose r e {1,2, •••} is a random variable distributed as 
above. If we upper bound \Pj\ < (ij — ij^i)(Lx + Ls), then 

E \0\Pj\ log \P 3 \] < E [f3r(L T + L s ) \ogr(L T + L s )\ 

= fi(L T + L S )E [r log r + r \og(L T + L s )] 
< 2E [r 2 ] +21og(i r + i s )E[r], 

where we used the fact that /3(Lt + Ls) < 2 and r logr < r 2 . 
We can write 

E [r] = - , E \r 2 } = Var(r) + E [rf = 

p p z 

Also, we use \og(Lx + Lg) < \og2Lx < 21og^ and find 
that 

E [f3\Pj\ log < + - log i < 16 + 81og i 

p z p p p 

where we used the fact that 4 ~ 2 2p < 16 for p> \ (Notice that 
p ->• 1, as /3 -> 0.). 



Appendix II 

Recall that Sj is the number of deleted bits from the 
substring of X between pivots Su and Su +1 . Let us denote 
by Cs the set of indices I for which Si appears between Si- 
and Si- +1 for some j E H. Similarly, let Ct denote the set 
of indices I for which T} appears between Si- and for 
some j e H. Let 8s t denote the number of deleted bits from 
Si and (St, denote the number of deleted bits from T;. We can 
write 

jeH iec s iec T 

Notice that the length of the interval that 8s l takes values from 
is Ls = 0(log jj) and the length of the interval that St, takes 
values from is Lt — i ■ Next, by application of Theorem H 
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we can write 

k< E^- 

jen 

E«* 

leCs 

zec s 
Pri 



Pr 



Pr 



•E 



E«* 



-E 



■ E 



E^ 

liec s 

E«* 

LiG£s 



E ^ - E 



E^ 



where in our derivation we used the fact that \Cs\ < k and 
\Ct\ < fc, since £5 an d £t are subsets of {!,■■■ , k — 1}. 



: 0(f3)k = 



E*n 



Pr 



Pri 



- E 



E«* 



1-2 exp(- 
1-2 exp( 
1-2 cxp( 

^1 - 2exp(- 
(1 - 2- n{n) ) ■ 



E *n - E 

IGCt 

2o(/3 2 )fc 2 |£ 



= o(/3)k 

E^ 

_ o(/3)/c 

E*n 

JG£t 



: (/?)fc = 



1^1 



oQ3)fc 



2o((3 2 )k 2 

- 20 &))-(l-2ex P (- 2( ' ( - ,A 



1-2 exp(- 



\Ct\ 2 L\ 
2o{/3 2 )k 2 



\Ct\L\ 



) > 



> 



^TTT»)V (1 - 2cxp(-2/3 3 o(/3») = 

o(iog ^) y 

(1 _ 2 _n (™)) = l - 2~ n (™) ; 



